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I. BACKGROUND AND THEORY 


A. MOTIVATION FOR DEVELOPING A NEW QUERY METHOD 


Traditional sequential methods of dealing with statis- 
tical queries on numerical databases involve the inspection 
of each data element within the database to determine if 
that element satisfies the queried condition. For a large 
database of thousands of items, such a Search results in 
poor response time and tremendous use of machine resources 
regardless of the size of the queried set with respect to 
the size of the database. Even when indexing techniques are 
employed to provide more direct access to indexed data items 
within the database, query resfonse time can be unsatisfac- 
toriiy slow when the database is large andis_ located on 
humerous pageS ina virtual paging systen. The primary 
reason for this slow response is that if the database 1S 
randomly distributed on the gqueried attribute the data items 
required by the query will be dispersed throughout the data- 
base and many of the fages containing the database will have 
to be transferred into primary storage to retrieve the data. 
For databases which are extremely large, such as the U. S. 
Census Bureau database, such a delay can be extremely long 
and guite unsatisfactory. One solution to this problen, 
suggested in [Ref. 1], involved the use of an additional 
file of information about the database called the database 
abstract. Indeed, Rowe also suggested that access to the 
database itself might even be unnecessary if sufficient 
information could be stored in the database abstract to 


answer the user's queries within a satisfactory tolerance. 
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B. DEFINITION OF A DATABASE ABSTRACT 


The concept of a database abstract should not be that 
unfamiliar to the reader as it iS guite Similar to that of a 
mathematical equivalence class. Recall that in mathematics 
when an equivalence relation is defined over a domain, all 
elements which are equivalent to each other can be grouped 
into a set called an equivalence class. If the original 
domain were a group (in the mathematical sense) closed under 
a certain operation, then the collection of equivalence 
classes defined by the given eguivalence relation will also 
be a group (called the guotient group), when a new operation 
1s appropriately defined in terms of the original operation. 
In the context of a database, the eguivalence relation 
appears in the form of a partition of the database on one of 
its attributes. Ali data items which are within a partic- 
ular partition with respect tc the partitioning attribute 
are considered to be eguivalent and thus the database 
abstract 1S Similar to a ‘quotient database’. As such the 
database abstract contains information about the equivalence 
classes rather than the individual data items. The most 
difficult guestion resulting from this approach is the 
problem of how to capture as much of the information 
contained in the individual data items of the equivalence 
class aS possible, in the most compact storage area 
possible, when tabulating the information to be recorded 
about the equivalence class. 

For the purposes of this paper, we will consider a 
sample database of medical information about patients ina 
particular hospital. Each data item consists of a six-tuple 
containing the patient number, sex, disease activity level, 
temperature, cholesterol level and prednisone level ofa 
particular patient. Thus each data element in this database 


contains six attributes. If we define a partition of the 
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database on the sex attribute, we obtain two eguivalence 
classes, one containing all tuples with male for the value 
of the sex attribute and the other all tuples with female 
for the value of the sex attribute. Our new database 
abstract contains only two data items. Obviously we cannot 
record a disease activity, temperature, patient number, etc 
for the equivalence classes because they are a different 
type of entity than the tuples of which they are composed. 
We can, however, record the mean value (just as an example) 
for disease activity, temperature, etc. for all the tuples 
in each equivalence class and still retain some portion of 
the information which is contained in the individual tuples. 

This is the basic idea of the database abstract except 
that the database abstract utilizes multiple equivalence 
relations to formulate orthogonal eguivalence classes over 
each of the attributes in the relation. For example, in 
addition to the partition discussed above over the _ sex 
attribute, the database can be partitioned over the tempera- 
ture attribute. All patients with temperatures less than 
35.5 are considered to have low temperatures, those between 
35.5 and 36.5 inclusive to have normal temperatures’ and 
those above 36.5 to have high temperatures. Under this 
equi alence relation, the database is partitioned into three 
parts. All tuples in each partition are eguivalent with 
respect to this relation. We can record statistical infor- 
Mation about all the equivalent tuples in each partition and 
attempt to capture as much information about these tuples as 
possible. In like manner the database can be partitioned 
over each of its attributes whether they be numerical or 
non-numerical and thus the final database abstract can be 
thought of as an crthogonal collection of equivalence 
classes with certain selected Statistical information 
recorded about each equivalence class inan attempt to 


Ccaptute many of the characteristics of the database ina 


12 


much smaller storage area than the actual database would 
contain. 

Even though attributes may be partitioned into equiva- 
lence classes regardless of whether they are numerical or 
non-numerical, there is an inherent difference in the flexi- 
bility of the partitioning process between these two types 
of attributes. For instance, in our example above 
concerning the temperature attribute, we may just as easily 
have decided to divide the database into low temperatures, 
low normal, high normal and high temperatures instead of the 
three partitions defined above. This new partitioning would 
Simply involve a redefinition of the boundaries of the 
partitions. In fact, the database could be divided into any 
number of partitions between two andthe number of distinct 


values for temperature in the database depending upon the 


desires of the user. Now consider the non-numeric attri- 
bute, sex. It must be partitioned one and only one way 
because it has only two distinct values. Therefore during 


this partitioning process, we must handle numeric and non- 
numeric attributes ina slightly different manner. 

The above discussion has raised two important guestions 
which we Must consider in constructing the database 
abstract. PiGst, for the numeric attributes, how many 
partitions should be chosen for each attribute to maximize 
the amount of information about the original database which 
is captured in the database abstract while still minimizing 
the storage space reguired to store the database abstract. 
This question will be analyzed in considerable depth in 
chapter three. The second question concerns what informa- 
tion should be recorded in the database abstract about each 
equivalence class in crder to again maximize the real infor- 
Mation content while still keeping the size of the database 


abstract as small as fossible. 
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C. THE INFORMATION CONTENT OF THE DATABASE ABSTRACT 


In [Ref. 1], that author included 16 statistics for each 
eguivalence class in his database abstract in order to be 
able to service a large number of different gueries on 
various’ statistics. In this discussion, we restrict 
ourselves to gueries on the size statistic in order to more 
eaSily direct our attention to other questions about’ the 
formulation of the database abstract. Here the size 
statistic refers to the number of database tuples which fall 
into a particular set defined by the user. The barious 
methods to define these query sets will be discussed later 
in this chapter. For the information content of the equiva- 
lence classes constituting our database abstract, we have 
used the five levels of knowledge defined in [Ref. 2}. 
These levels represent increasing quantities of information 
and storage requirements and result in increasing accuracy 
of query estimation results. 

As discussed in [Ref. 2], the first level of knowledge 
contains only the size of each equivalence class. Therefore 


this database abstract would contain the size of each parti- 


tion of each attribute in the database. The total informa- 
tion content wou ‘! deperd on tne number of partitions chosen 
for each of the .imerical attributes in the database. The 


second level of knowledge contains in addition to the size 
InTPOE Nat Lon, the mode frequency and number of distinct 
values of each attribute in every equivalence class in the 
database. Therefore, if each of the five numerical attri- 
butes in our medical database were divided into four parti- 
tions and the non-numerical sex attribute into ‘twe 
partitions this database abstract would contain 2* (4*5*5 + 
2*5) or 220 additional data items. This level of knowledge 
is equivalent to that referred to as level 2a in [Ref. 2]. 


The level three database abstract EOE Stns systen 
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Berresponds to the level 3a abstract of [{Ref. 2] and 
includes the frequency of the second most frequent value and 
the frequency of the median frequency value for each attri- 
bute in each equivalence class and therefore adds an addi- 
tional 220 data items to the database abstract file. Level 
four is a logical extension of levels two and three and 
contains a list of the frequency of the most freguent 
through the kth most freguent value for each attribute in 
each equivalence class in the database where k represents 
the number of distinct values of the given attribute in that 
equivalence class. Finally, the highest level of knowledge 
is the tagged frequency distribution list of level five. 
This database abstract contains a list of the values of each 
attribute in each equivalence class together with the 
frequency of occurrence of that value in the given equiva- 
lence class. This is the highest level of knowledge in the 
context of this hierarchy because it represents the size of 
all first level intersections cf all sets in the database 
and as such corresponds to level one knowledge if we had 
considered these first level intersections as the primitive 
sets in the database. 

The choice of these particular items of information is 
not a hard and fast requirement for a system of this type. 
Other items of information may be substituted for those 
included above. For example, instead of the second nost 
frequent value in level three, the least frequent or anti- 
mode frequency of each attribute in each equivalence class 
could be chosen. The important requirement for the purposes 
of this system is that the levels of knowledge represent a 
sequence of increasing information content in the database 
so that we can demonstrate an increasing accuracy of query 
estimate results and weigh these benefits against a greater 
storage requirement and a decreasing speed of execution as 


higher level database abstracts and their correspondingly 


15 


more complex rul ; are u :d. Do tdct, cr information 
bes 2s frequenc inform tion could concei. Ly be used to 
capture the information content of each ec _valence class, 
however the frequency information discussed above seems to 
fit more appropriately into a well defined hierarchy than 
any other types of information considered. 

Now that the information ccntent of the various levels 
of knowledge has been described, an appropriate operation on 
that information must be defined in order to gain the best 
possible estimate of the size of any gueried set. Note that 
the basic operation cr the original database in order to 
answer a query is the « ration of search and comparison to 
the given i Sut conditi s. RFecause the database abstract 
contains a more abstra and concise representation of the 
information content of the original database, the basic 
search operation will no longer suffice. A new operation 
must be defined in terms of the base operation in a well 
defined manner to produce accurate estimates to user's 
queries in much the same manner that a new operation is 
defined on a quotient group in terms of the original opera- 
tao nr. The term operation is used quite loosely in this 
context be use it actually takes the form of a database of 
rules whi 2xtract the queried information “rom the infor- 
Mation concent of the database abstract. Tne search and 
comparison operation is used to locate and apply the appro- 
priate rule for a given set of input conditions in the forn 
of a guery. The following chafter will describe this data- 
base abstract operation by describing the ru’?s used in the 
prototype system constructed to implement suc.. a query esti- 


mation systen. 
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D. HYPOTHESES TO BE CONSIDERED 


The remainder of this paper will describe a system which 
calculates upper and lower bound and estimates for the size 
of a given set uSing only the information contained in the 
database abstract. Various characteristics of the database 
abstracts will be varied and the system will be tested to 
determine what effect these characteristics have on the 
system responses. At this fpoint itis appropriate to 
hypothesize about the expected effects of varying these 
characteristics. 

First of all it is expected that uSing higher levels of 
knowledge will produce tighter upper and lower bounds but 
will also result in increased system response times and 
greater storage requirements for the database abstract. Tee 
is also suspected that there is one level which is the best 
to use in the sense that it causes much tighter upper and 
lower bounds than the next lower level but results in only a 
modest increase in the response time and storage reguired 
over the next lower level. Through our testing we will 
attempt to determine if such a ‘best' level exists and if 
so, which of our levels it corresponds to. 

The second hypothesis for the system is that as’ the 
humber of partitions of the database abstract increases, the 
system will produce tighter upper and lower bounds but will 
suffer from longer response times and larger storage 
requirements for the database akstract. It is also believed 
that there exists an optimum granularity for the databse 
abstract in the sense that increasing the number of parti- 
tions above that granularity will result in only slightly 
tighter bounds but will result in constant increases in 
system response times and storage required for the database 
abstract. Through our testing we will attempt to determine 


what this optimum granularity is for a sample database and 
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what charactacteristics of that database determined this 
optimum granularity. 

The third and final hypothesis for this system concerns 
the method that the user uses to input his query set. Ton 
order to make the system aS general purpose as possible the 
user is allowed to enter his sets by specifying ranges of 
attributes. For example, using the sample medical database 
utilized throughout this paper, if the user wishes to know 
the size of the set of ail patient with temperatures between 
36 and 37 who also have a cholesterol level between 230 and 


250 he may specify the query set: 
temp (36,37) * chol (2307250). 


This method has the advantage cf being very general purpose 
but suffers the penalty of less accurate bounds generated by 
the system as will be seen in chapter three. If the user 
queries the actual partition sets of the database abstract, 
the bounds produced by the system will be much tighter, 
however, the user may have no need to know the size of such 
sets. As a solution to this conflict between tighter bounds 
versuS a more general query method, we submit that for many 
databases, there exist predefined ranges of attribute values 
which are queried often by the user community. If such 
ranges were forced to be actual partition sets during the 
generation of the database abstract, a significant improve- 
ment in the tightness of calculated bounds could be achieved 
when such ranges were queried by the user. Testing on such 
a specially constructed database abstract will be conducted 
to determine if the expected benefits are in fact as great 


as expected. 
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II. IMPLEMENTATION OF THE D 


A. GENERAL INFORMATION 


The previous Chapter intrcduced the operation on the 
database abstract equivalence classes as a database of rules 
which extract the queried information from the primitive 
information stored in the database abstract. This database 
of rules implies that the most natural method to implement 
such a system would be as a rule-based production systen 
where the system must decide which rules apply to the given 
guery and apply only those rules to produce the most accu- 
rate result. Such a prototype rule-based system was imple- 
mented as a part of this project again utilizing the medical 
database described in chapter one. This chapter will 
describe in some detail the structure and organization of 
this systen (to be called DAQUES, for Database Abstract 
QUery Estimation System for the remainder of this paper). 

The first major decision to be considered before 
commencing the construction of this system was the progran- 
ming language to utilize. After considering both Lisp and 
PROLOG, the advantage of faster speed of execution because 
of the Lisp compiler seemed tc be surpassed by the more 
Natural syntax and excellent pattern matching capabilities 
of PRCLOG. Because this system is only a prototype systen, 
speed of execution was not considered to be aS important as 
the ease of understanding the lcgic of the rules in the rule 
base provided by the PROLOG language. Additionally, this 
author also believes that the ower of logic programming is 
just now being recognized and such languages may very well 


become the principal ones for artificial intelligence 
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programming applications in the very near future. It seems 
advantageous for computer sScientis s outside of Japan to 
begin to learn the strengths and weaknesses of this 
language. 

The DAQUES system consists of two major parts, the first 
being the database abstract generation portion and the 
second the actual query system. These subsystems are almost 
completely independent of each cther, their only link being 
that the output of the generation subsystem is the input to 
the query subsystem and the file t‘attrlist'" which is 
consulted by both subsystems. The subsystems will be 


described in the following two sections. 


B. THE DATABASE ABSTRACT GENERATION SUBSYSTEM 


1. General Characteristics of the Subsystem 


— == SS ee a 


The database abstract generation subsystem is not 
intended to be used by the average database user with only 
query privileges. Use of this system will recuire a rela- 
tively sophisticated knowledge of the meaning of the tern, 
database abstract, and of the the merits of the various 
tradeoffs which must be made when constructing the database 
abstract. For this reason, it is recommended that the use 
of the database abstract generation subsystem of DAQUES 
reguire the user to hold the database administrator privi- 
lege or its eguivalent. 

The database abstract generation Subsystem 1S 
constructed to be as general as possible so that it can be 
applied to an arbitrary database. However, there area 
Minimal number of facts in the system which are specific to 
the example medical database. These facts are located at 
the end of the tattrlist' file and provide a list of the 
attibute names for the database, a designation of the attri- 


butes which are non-numeric and their values in the 
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database. These data items would be the only portion of the 
entire system which would have to be altered to port the 


system to another database. 


2- The User Interface 


The most visible porticn of the database abstract 
generation subsystem is the user interface section which 
prompts the database administrator for all required input 
information needed to construct’ the desired database 
abstract and converts that information into a form which the 
rest of the program can understand. The information needed 
is the name of the database to be processed, the level of 
knowledge for the particular database abstract to be 
constructed (could be tall' if the administrator desires to 
construct four leveis at once), the number of partitions 
that each attribute within the database is to. be divided 
into and the name of the output file or files in which to 
store the database abstract. The name of the database file 
May seem to be unnecessary infcrmation, after considering 
that several rules in the system are specific to a partic- 
ular database. With the current system it is true that the 
database name is unnecessary. However, it is possible to 
process one or more databases with no attribute names in 
common and therefore include all necessary rules for all of 
the databases. The rules would not interfere with each 
other because the clause heads cf the rules for the inappli- 
cable database would not match, given the distinct attribute 
hames. Therefore, to provide added flexibility, this infor- 
Mation is included as part of the input from the database 
administrator. 

The level of knowledge to be input must be one of 
the five discussed in chapter cne or the user may specify 
‘all' levels. The ‘allt option is a bit of a misnomer 


because it calculates only levels two through five. This 
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distinction is because of the fundamental difference between 
the calculations for level one and those for levels two 
through five to be discussed later in this chapter. The 
advantage to choosing the tall' option is that the database 
need only be processed once to froduce four different levels 
of knowledge whereas, if a particular level is specified, 
the database is processed once for each level. Therefore, 
the ‘all' option can provide a Significant savings in time 
and machine resources especially for a large database. EE 
the user specifies the ‘all‘* option, he must also specify a 
list of output files in which tc store the database abstract 
files for levels two through five instead of a single file 


name. 


3. The Partitioning Process 


The most difficult and important decision which the 
database administrator must make concerns the granularity of 
the partitions for each attribute. If the attribute is 
non_numeric, there is only one acceptable number of parti- 
tions and it is input from the database specific facts in 
the tattrlist' file. If the attribute is numeric, the user 
must input the desired number of partitions for that attri- 
bute. The attribute list file is then checked to see if the 
partition boundaries have previously been calculated for 
this granularity. If so, the tabulated boundaries are used. 
If not, the user is asked to choose between automatically 
generated partition boundaries or specific boundaries which 
he may input himself. 

The former method for determining the partition 
boundaries uses an algorithm tc attempt to place an equal 
humber of values in each partition. The input to this rule 
is the number of partitions, and a list of the values of 
that attribute together with their frequency of occurrence 


in the entire data ase. This list 1S equivalent to level 


ae 


five knowledge about the entire database and must be 
produced by the database administrator prior to utilizing 
the system. This information is easily generated by running 
this same system and specifying ‘one' as the number of 
partitions. The algorithm divides the number of values in 
the database by the granularity and then passes through the 
frequency list including each value until its inclusion will 
cause the number of values in the partition to exceed the 
calculated size of the partiticn. The actual size of the 
first partition is then subtracted from the size of the 
database. That number is divided by the number of parti- 
tions minus one and the algorithm is called recursively. 
When the database contains a large number of distinct items, 
none of which occur very frequently, (i.e. even distribu- 
tion), this algorithm is guite good for producing partition 
boundaries resulting in partiticns of relatively egual size. 
However, when an attribute hasS one or more values which 
occur many more times than the cther values in the database, 
fe. ce. the distribution is highly uneven), this algorithn 
results in partitions of very different sizes and conseg- 
uenotly, as will be evident in our discussion of the guery 
subsystem, reduced accuracy of guery results. 

In the case of the highly uneven distribution, or to 
create partition sets which are meaningful in a particular 
application field, the administrator may explicitly specify 
the boundaries of the partitions. This option is a highly 
desirable one especially in the latter circumstance. Many 
attributes fall naturally into fartitions because of conven- 
tions within the application area or merely for convenience. 
AS mentioned in chapter one during the introduction to the 
meaning of the database abstract partitions, the temperature 
attribute may be divided intc low temperatures, normal 
temperatures and high temperatures by specifying the numeric 


convention for defining these terms. In the context of our 
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program, such user specified partition boundaries may not 
result i: partitions of r itively equal size, however the 
partitions themselves wil. oe sets which are much more nean- 
ingful to the expected user of the system than the algoritn- 
ically produced partitions. For reasons that will be 
discussed in depth later in this chapter, there are also 
tremendous advantages in accuracy of the bounds calculations 
for the system user when his query set corresponds exactly 
to aounion or intersection of the exact partition sets 
rather than numerically specified boundaries. 

For example, if the user is interested in the set of 
male patients with temperatures bet een 35.5 and 36.5, he 
may explicitly guery this set with no knowledge of the 
partitions contained in the database abstract. If this set 
does not correspond exactly to the boundaries generated by 
the partitioning algorithno, the query preprocessor nust 
determine the minimal union cf actual partitions which 
covers this desired set and calculate the upper bound for 
the query based on this set. Ina Similar manner, to calcu- 
late a lower bound on the query, the auery preprocessor 
determines the maximal union of actual partition sets which 
is completely included in this desired set and calculates 
the lower bound of the query based on this set. If, on the 
other hand, the database administrator is aware of commonly 
queried sets in the applicaticn area such as the normal 
temperature range, he may explicitly specify that the parti- 
tion boundaries correspond to these ranges, providing much 
more accurate query results for such sets. 

To extend this concept cne step further, the data- 
base administrator may specify an alias for that partition 
hamne such as ‘'normtemp'. Using this facility allows the 
user to express his query in terms of the ‘normtemp! set 
instead of the conventional name for the second partition 


set for the temperature attrikute, 1.e. temp(2). The 
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advantages of explicitly specifying partition boundaries for 
particular attributes should become more apparent after 
reading the discussion of the query subsystem of DAQUES. 
Whichever method the database administrator chooses 
to arrive at the boundaries of the partitions, these bound- 
aries are inserted into an attribute list and added to the 
‘attrlist' file for use in later database abstract genera- 
tion executions and for use by the preprocessor of the guery 
subsysten. Additionally a list of the attribute names and 
the number of partitions is written into the level one data- 
base abstract file and is passed on to the ‘frequencies! 
file which performs the actual database abstract generation 


function. 


4. Generation of the Actual Database Abstracts 


First we will discuss the level one calculation. As 
Mentioned previously, there is an inherent difference 
between the method of calculaticn for the level one database 
abstract file and the remaining levels. Recall that level 
one information contains the size of each partition with 
respect to each attribute in the datapdase. When the user 
interface rules determine that level one knowledge is 
desired by the database administrator the ‘levell' file is 
consulted. Conseguently, when the 'ex' function is calied 
by the user interface, it matches the level one ‘ex! rule 
and begins the calculation cf the level one database 
aestract. The first task for this calculation is the 
initialization of the totals for the size of each partition 
located in the PROLOG internal database to zero. Then each 
tuple of the database is read in from the database file. 
The "check" function is called for each of the attributes in 
the list passed to this file Ey the user interface. The 
"check" rule determines the value of the given attribute in 
the current tuple. It determines the partition set into 
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which this value falls and increments the total for that 
partition set in the internal data!ase by one. After the 
entire database has been processed .a1 this manner, level one 
information is contained in the PROLOG internal database. 
The ‘continueall' clause accesses all of this information 
and writes it to the external file specified for the level 
one database abstract by the database administrator. 

When the database administrator specifies any level 
other than level one, the user interface rules consult the 
'freguencies' file as weli as the corresponding ‘level' 
fae. The ‘freguencies' file produces the same information 
for levels ‘wo through five and asserts that information 
into the PRLULOIG internal database for use by the appropriate 
‘levelt file. The information produced by the 'frequencies' 
file is exactly the level five database abstract information 
and from this information, levels two through four informa- 
tion can be derived. Recall that level five information is 
a list of the values that occur for each attribute in each 
of the partition sets together with the fErequency with which 
those values occurred in that set. Therefore the first task 
accomplished by the 'frequencies' file is to assert an empty 
jist as the initial tagged freguency distribution list for 
each attribute and each partition set in the database 
abstract. Then, each tuple is read from the database file. 
The ‘checkallattrs' rule 1s invoked for each attribute. 
This rule finds the value of the ‘partitioning attribute' 
from the current tuple and determines which partition set 
that tuple belongs to with respect to the ‘partitioning 
attribute’. Then the ‘check rule 1s invoked for each 
numeric attribute. This rule finds the value of the ‘check 
attribute' and accesses the freguency list for the partition 
set of the ‘partitioning attribute" with respect to the 
"check attribute’. The tintolist' rule searches the iist 


for the value of the ‘check attribute! and increments the 
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freguency corresponding to that value by one if the value is 
found. If the value is not found in the list, the value is 
inserted into the list with a frequency of one, maintaining 
the ascending order of values (not frequencies) within the 
mist . This process is repeated with each numeric attribute 
substituted as the ‘check attribute’. Then the 'checkal- 
lattrs' rule is called with the next attribute substituted 


as the ‘partitioning attribute". 


5. Processing an Example Tuple 


As an example of this frocess, recall the medical 
database with its six attributes; patient number (patno), 
sex, disease activity (da), temperature (temp), cholesterol 
level (chol), and prednisone level (pred). The values of 
these tuples are arranged in that order within the input 


tuples so that an input tuple would appear as follows: 
10, female, 12,:36.2, 230, 55) 


The "checkallattrs' rule would first be invoked with the 
patno attribute as the "partiticning attribute’. The value 
or 10 for patno would place this tuple inthe  patno(1) 
partition, for example. Then the check rule would be 
invoked with the patno attribute as the ‘check attribute! 


and the following list would be searched for the value 10: 
Gaea (treqdist, patno (1), patno,{[4,1],(7,2]j,[ 10,3 ],{ 15,6 ]}) 


In this case, 10 would be found and its frequency, 3, would 


be changed to 4. Then the check rule would be invoked with 
da (the next numeric attribute) as the ‘check attribute' and 
the following list would be searched for the disease 


activity value of 12: 


data (fregqdist, patno(1) ,da,[{[5,3],{ 14,4],[ 24%,5}]]) 
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In this case the value of 12 wculd not be found so the two 
element list [12,1] would be inserted into the list to 


produce the following new list: 
data(fregdist, patno(1),da,{[{5,3],{12,1],{ 14,4 ],[ 24,5] )) 


In a Similar manner the 'checkt rule would be invoked with 
the attributes temp, chol and pred as the ‘check attri- 
butes', Then with the patno attribute execution of the 
'checkallattrs' rule completed, this rule 1s invoked with 
the sex attribute as the ‘partitioning attribute’. The 
partition set 'female' is determined to be the one to which 
this tuple belongs with respect to the sex attribute and the 
"check' rule is invoked with each Nineric attribute as the 
‘check attribute'. In like manner, the ‘checkallattrs' ruie 
1s invoked with the da, temp, chol and pred attributes as 
the "partitioning attribute' te complete the processing of 
this one tuple. Then the next tuple 1s read in and the 
process repeated. 

When the entire database has been read and 
processed, the internal PROLCG database contains tagged 
frequency distribution lists fcr every partition and each 
attribute. A good way tu. visualize this info: ition is to 
recall that the partitions or eyuivalence ci ieS or our 
database abstract are the primitive data items .f the 'quo- 
tient database’. We are attempting to capture as much of 
the information content of the tuples of the original data- 
base as possible in our database abstract. We cannot tabu- 
late a value for each of the attributes for the partitions 
because the partitions are made of a collection of tuples of 
the original database. Instead we chose in level five to 
Substitute this tagged freguency list for each attribute as 
the ‘abstract value' of that attribute. In terms of our 
example, instead of tabulating a da value for the patno(1) 


partition we tabulate the list cf the da values from all the 
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tuples which fall into the jfatno({1) partition and the 
frequency of occurrence of each value in that partition. 

Now that all this information has been asserted into 
the PROLOG internal database, the ‘freguency' file invokes 
the 'continue' rule. The content of this rule will depend 
on which ‘level! file was consulted by the user interface 
after inputting the desired level of knowledge from _ the 
database administrator. If the desired level is five then 
the ‘continue rule merely writes all of the tagged 
freguency distribution lists to the external file designated 
by the database administrator to receive the level five 
database abstract. If level four is chosen, the ‘continue’ 
rule compacts the tagged frequency lists removing the values 
of the attributes and leaving only the frequencies. These 
frequencies are then sorted in descending order and written 
to the file specified for the level four database abstract. 
If level three is designated, the tagged frequency list is 
sorted in descending order of the frequencies and the second 
and middle elements of the list are extracted, corresponding 
to the frequencies of the second most frequent and the 
median frequency value of the check attribute on that parti- 
fron . These values are written in an appropriate format to 
the file designated for the level three database abstract. 
Finally, if level two is specified, the tagged frequency 
lists are again sorted in descending order of freguencies 
but the first frequency in each list and the length of the 
list is written to the appropriate file, representing the 
frequency of the most frequent value and the number of 
distinct items in the partition. 

It should now be obvious to the reader that’ the 
calculation of the database abstracts for levels two through 
five involves a common initial calculation. When the data- 
base administrator specifies ‘all' as the level of knowledge 


for the database abstract calculation, the talllevels' file 


IS, 


is consulted in place of a specific ‘level* file. The “come 
tinue' rule in this file invokes four individual frcules 
called ‘continue2', "continue3', ‘continued’, and ‘con- 
tinue5', all identical to the ‘continue' rules contained in 
their respective 'level't files. In this way the ‘continue! 
rule for the talllevels' file processes the tagged frequency 
distribution lists lccated in the PROLOG internal database 
producing all four desired datakase abstract files. 

Now that the database abstract generation subsysten 
of the DAQUES system has been described in detail, we will 
continue our discussion in the following section with a 
description of how these database abstract files are used by 
the guery subsystem to produce bounds and estimates for the 


user's queries. 


C. THE QUERY SUBSYSTEM. 


1. Structure of th Dsysten 


After the database administrator has completed the 
construction of the database abstract files, the second 
subsystem of the DAQUES system should be ready to be used 
with no modifications. This suksystem was designed to input 
all database specific informaticn, such as rules concerning 
attribute names, frem the database abstract files and the 
Yattplist® file. Therefore, the query subsystem should be 
able to guery multiple database abstracts independent of 
their particular partitions or even the database from which 
they were constructed, provided that the database abstract 
waS built entirely with the generation subsystem of the 
DAQUES systen. Of course, this system is designed for use 
by the standard user with only guery privileges. 

The guery subsystem has three major program parts, 
the user interface and help feature, the guery preprocessor 


and the actual calculation portion of the progran. With 
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regard to file organization, the rules are divided into five 
files. The 'master' file contains the user interface and 
all preprocessing functions as well as many rules’ too 
general to be included in one of the specific files (e.g. 
the ‘member! rule). The tinfot file contains the rules for 
the help feature of the user interface. It is currently a 
rather limited feature but was written with future expansion 
mien 1nd. The 'sups! file contains all the rules for calcu- 
lating the upper bound on the size of a query. It has its 
rules divided into those for unions and those for intersec- 
tions and then subdivided into rules for levels one through 
five within the union and intersection divisions. The 
"ainfs' file contains the rules for calculating the lower 
bound on the size of the queried set. Its organization is 
identical to the 'sups' file. Finally, the ‘ests! file has 
all the rules for calculating an estimate for the size of 
the input set. It has only distinct sections for union and 
intersection rules. 

Within the ‘sups', ‘infs', and ‘tests! files the 
basic prcegram structure is quite Similar. The ‘dosup', 
'doinf', and ‘doest' rules each input a statistic, a set, 
and an attribute. The statistic will always be ‘size' in 
the current system implementaticn, however, the system is 
designed so that additional rules might be added at a later 
time to extend the capabilities of the basic systen. Also 
for this reason the attribute is input to these rules to 
allow future queries such as the mean value of temperature 
for the input intersection or union set. With this exten- 
sible format, a general purpose user interface and prepro- 
cessor could invoke the appropriate ‘do' rule merely by 
passing the proper arguments to the rule. 

When the query sets are input to the ‘dot rules, 
they arein prefix form or in PROLOG terminology, they 


consist of a single argument functor, where the functor name 
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must be elther "and 'eer ors The single argument to these 
functors contains a list of the sets in the intersection or 
union and aS such can be of arbitrary length. Additionally 
any individual element of the list can be one of these 
Single argument functors indicating a lower level union or 
intersection. Only time constraints limit the size of the 
intersection or union list or the nesting level of unions 
and intersections within unions or intersections. Because 
of this PROLOG list structure embedded within a functor, the 
system becomes very general purpose and handles a very wide 
assortment of input query sets. Also such guestions as the 
optimum normal form of an input query become much iess 
important than they are in systems with a much more struc- 
tured input set. In [Ref. 1], for example, the query systen 
inputs sets 1n prefix form with two arguments only. 
Therefore, in order to query a set such as 
and (lowtemp,lowchol,male), that system had to decide which 
of the three possible formats, and(and(lowtemp, lowchol), 
male), and(lowtemp, and(lowchol, male)) or and(and(lowteap, 
male), lowchol), produced the best query results. (In his 
system, Rowe decided to process all three query forms and 
use the most accurate results.) Such guestions do not apply 
in a list oriented input format systen. 

The list format for the input sets does reguire a 
very different processing strategy, especialiy when the next 
set in the input list might very well be a very iarge union 
or intersection of other sets. Tests such as disjoint and 
subset rules become much more complicated because they must 
be more general purpose than systems with a fixed number of 
arguments. The general processing strategy for this format 
involves examining each set in the argument list. The set 
is checked to see if it is a ‘'simple' set in the sense that 
it has size information stored about it in the PROLOG 


internai database or in the system cache. If it does, the 


ae 


program retrieves this information and moves on to the next 
set. If the current set has nc data either in the internal 
database or in the system cache, then the set iS not ‘'sin- 
ple’ and the appropriate ‘do' rule is invoked for this set. 
When this ‘dot rule returns from execution, the set will 
have keen converted toa 'simple' set in that it will have 
all the same information stored about it in the system cache 
that an actual partition set has stored inthe internal 
database at the time program execution started. Another 
important point is that if this same union or intersection 
occurs at another point in the same guery or in a subsequent 
query within the same session, the union or intersection set 
Still aprears as a ‘simple' set to the system. A more thor- 
ough discussion of the details of this concept will be 
presented following the presentation of the user interface 


and preprocessing sections of the yuery subsysten. 


2. he User Interface and Help Features 


Because this system is intended to be used by the 
novice user, it includes a very friendly user interface and 
an expandable help facility. The user interface must input 
several items of information from the user and using this 
information, must control the ccnsulting of the proper files 
to ensure the results desired by the user. The user inter- 
face first asks the user what level of knowledge he wishes 
to use for his query and gives the user the choice ofa 
normal, alternate, or special database abstract file. (It 
is intended that database abstracts with several different 
granularities of partitions or aliases will be available to 
the user depending on his accuracy needs or the partition 
conventions which he commonly uses.) If the user does not 
understand the different levels of knowledge or does not 
remember which database abstract files are currently associ- 


ated with the terms normal, alternate and special, he may 
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simply respond by typing ‘thelr’. This will produce an 
explanation of the current database abstracts available as 
well as a discussion of the differences in levels one though 
five knowledge. After the user has entered this informa- 
tion, the user interface must consult the appropriate data- 
base abstract files as well as the alias file and attribute 
list files discussed in the Section on the database abstract 
generation system. The user interface then stores knowledge 
of the current level that the user is utilizing so that 
during the next guery, no additional files need be consulted 
unless the user wants to use a new level of knowledge or a 
new database abstract file. 

The user interface next inputs the statistic to be 
queried which currently must be the ‘size' statistic. ete 
the user inputs a different statistic, the help feature will 
provide an explanation that only the ‘'size' statistic is 
currently operational on this systen. It 1S anticipated 
that when other statistics such aS mean are provided in the 
system, the user interface will next ask the attribute that 
the statistic 1S to pe queried over. Because the size of a 
set is not associated with any particular attribute, the 
system currently omits this question. 

The user interface next inputs the set to be queried 
if in fixe fort. Again failure to input a proper set will 
resuit in acall to the help facility and an explanation of 
the proper format for inputting sets. It was decided to 
input the query set in infix form after several weeks of 
testing the program using the prefix form as input for the 
sets. The prefix, list oriented Set notation is quite 
useful for providing generality to the program, however, 
requiring the user to happen ufon the proper combination of 
parentheses and square brackets needed to input the desired 
set completely counteracted all other attempts made to make 


this interface user friendly. As an example, consider the 
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query wishing to know the size of the set of all males in 
either the first or second cholesterol partitions. The 


prefix form of this sets is the following: 
and ({male,or ({chol (1),chol(2) ]) }) 


Compare the above notation with the infix notation shown 


below: 
male * (chol(1) + chol(2) ) 


It should be obvious how this advantage in simplicity 
becomes increasingly more important aS query sets become 
more complicated. Therefore it was decided that the small 
degradation in system performance needed to implement an 
infix to prefix preprocessor was justified. This prepro- 


cessing feature is described in the following subsection. 
3. The Preprocessor. 


One need fora set format preprocessor was intro- 
duced in the previous section. This subsection will demon- 
strate two additional needs for preprocessing and describe 
the specific performance of the rules to implement these 
functions. 

During the first subsection above the general nature 
of the input set was discussed in some detail. One of the 
disadvantages of this format is the capability it provides 
to the user to input sets which execute much less’ effi- 
ciently than a different expression of the same set would 


execute. For example, the set: 
chol{1) + chol{(2Z) + chol (3) 
translates into the prefix forn: 


or({chol (1),chol (2),chol (3) }) 


oD 


which executes much more efficiently and accurately than the 


query: 
(chol (1) + chol(Z)) + chol(3) 
which translates into prefix as follows: 
or ([or ([ chol (1) ,chol(2) J) ,chol (3) }) 


The two sets are logically identical and should result in 
the same upper and lower bounds and estimates from the 
system. However, they would not without the second function 
of the preprocessor. This function is descriptively called 
the 'sguasht function because of its capability to transform 
a multilevel set into a logically eyuivalent single level 
set which will produce guicker and more accurate results. 
(For a thorough discussion of the ‘optimal form' of a guery, 
refer to [Ref. 2] ). This function does not just provide 
protection against the novice user who inserts extra paren- 
theses where they are not needed. The output of the third 
feature of the preprocessor, described below, is often ina 
form which needs to be 'sguashed' to execute efficiently and 
accurately. 

During the discussion of the alias = ‘re of the 
generation subsystem, it was mentioned that, in e2nerai, the 
input sets to the system would be in the form of unions or 
intersections of sets defined by a range of values of a 
particular attribute. So far we have only discussed sets 
which are exactly one partition set of a particular attri- 
bute. It was also mentioned that such exact partition sets, 
especially when they are aliased to a descriptive name, are 
quite useful when a particular attribute was partitioned 
during the database abstract generation, to correspond 
ex :tly to a set which is of some importance in the applica- 
tic field and therefore “ould commonly be a part o various 


que: ies. However, wher the akove conditions do nut apply, 
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it iS appropriate to allow the user to express the guery set 
in terms of a range of values cver a particular attribute. 
The query subsystem allows the user to express his query set 
in either of the formats as well as in terms of some alias 
name for a particular partition set. The third feature of 
the preprocessor provides the ccnversion of these forms to a 
common form to pass on to the calculation section of this 
subsysten. 

To provide a better understanding of the three 
stages of the preprocessor, we will follow the execution of 
a particular query set from its entry into the system until 
it is passed to the calculation systen for actual 


processing. Consider the following query: 
Male * chol (1715, 280) 


which means that the user wishes to know the size of the set 
of all male patients with cholesterol levels from 115 to 280 
inclusive. The user interface fasses this infix form to the 
‘convert! rule which matches the '*' symbol producing an 
‘and' functor with arguments of whatever is returned fron 
the recursive calls to ‘convert! made for the left and right 
Sides of the '** symbol. Had either of these sets been a 
lower level (parenthesized) unicn or intersection, then one 
of the elements of the functor's argument list would have 
been another functor. In fact, for infix expression without 
parentheses that are longer than two elements, the ‘'convert' 
rule uses left precedence rules and operates as if there 
were parentheses around the righthand side of the expression 
producing an output prefix expression of depth one less than 
humber of terms in the the input expression. This limita- 
tion is solved by the 'sgquash' function. 

When the 'convert' rule is invoked for a simple set 
like ‘'male', it recognizes this set aS a Simple set and 


merely returns the set name. When it is invoked with an 
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alias name, it checks the alias list and returns the name of 
the partition set to which the alias refers. If the rule is 
invoked with a range expression such as chol(115,280), it 
recognizes the two arguments to chol as lower and upper 
limits to a range and calls the ‘covert and ‘iscovered' 
rules. The ‘covert rule uses the upper and lower limits and 
the partition boundaries stored in the ‘tattributelist' facts 
to find the minimal union of actual partition sets for the 
given attribute which covers the desired range. It returns 
this union as the set to be used by the calculation system 
when the upper bound or sup of the statistic is being deter- 
mined. Then the *iscovered' rtle uses the same information 
to find the maximal union of actual partition sets for that 
attribute which is covered by the given range. This union 
is used by the ‘infs' file when calculating the lower bound 
on the size of the set. Suppose the ‘covert and ‘tiscovered' 
rules return or([chol(1), chol(2) ]) and or([chol(t) ]) as the 
two unions discussed above. Then the two sets to be passed 
on to the ‘sguash' function for the next stage of prepro- 
cessing would be and ({male, or({[chol(1), chol(2)]J) J) and 
and({[male, or([chol (1) }) )). When ‘squash' is called with 
the first set descriftion it returns the same form because 
there is no 'sgquishing' possible on this set. Consider, 
how ,er, 1f the input set had been a union instead of an 
intersection. Then the input cf the squash function would 
be or({male, or ({chol(1), chol(2) }) }) which would be 
‘squashed* into the more accurate and efficient form 
or ({ male, chol(i), <chol(2) 4). For the second set the 
*‘squash* function recognizes the single element union which 
results from the generality of the ‘iscovered' rule. This 
set will therefore be 'squashed' into and([male, chol(1) ])), 
a form which will provide much more accurate and timely 
results. 
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Now that we have described how the user's guery is 
preprocessed from its various pcessible input formats into a 
common form, in the following subsections we will describe 
how the actual calculation rules are implemented to produce 


the query results. 


4. Testing for Disjoint Sets and Subsets 


The detection of sets in the input union or inter- 
section which are disjoint or which are subsets of each 
other can greatly increase the accuracy of the query 
results. For this reason, the calculation rules contain 
complex tests for these conditions. The disjoint test 
primarily involves a search for setS which represent two 
different partitions of the same attribute which, by defini- 
tion, are disjoint. In addition, the disjoint test involves 
a complex series of tests to determine whether a union or 
intersection of sets is disjoint from another union, inter- 
section or simple set. The rules which are implemented in 
this section are discussed in detail in [{Ref. 1]. 

The advantage of determining that an intersection 
contains two or more setsS which are disjoint is that all 
further calculations may be skipped because the size of that 
intersection is most certainly zero. However, the advantage 
of determining that two or more sets in a union are disjoint 
is not so straightforward. If all the sets in the union 
were found to be disjoint from all others in the union, then 
we could arrive at the size of the set by merely adding ali 
the sizes of the comfonent sets. However if only two sets 
in a very large union list are determined to be disjoint, 
what advantage have we gained? We could consider this 
disjoint union as a simple set with a definite size eyual to 
the sum of the sizes of its twe components. There are two 
problems with this approach. The first is the guestion of 


what to do about the nontransitivity of disjointness. For 
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example, if set A is disjoint from set Band set B is 
disjoint from set C, then we cannot conclude that set Ais 
disjoint from set C. Therefore how do we decide which of 
the two pairs of disjoint sets to combine into a single set? 
The second problem relates to how to convert this disjoint 
union into a true Simple set in the sense that it has all 
the information tabulated about it in the cache that an 
actual partition set would have tabulated about it in the 
PROLOG internal database when program execution started. If 
the desired level of knowledge is one, there 1s no problem 


beca :se the size of the set is level one knowledce. 


However, if the desired level is five, the leve ‘ive infor- 
Mation about this disjoint union would have to calculated 
and added to the cache. Because of these dirfiicult ques- 


tions, it was decided not to test for disjointness during 
the union calculations. The level one upper bound on the 
Size of a union assumes the worst case, that the component 
sets are disjoint, in arriving at the upper bound. So even 
for this single level where the majority of the problems 
discussed above do not apply, the disjoint test would not 
demonstrate any possible advantage. 

it would be an interesting extension to this s’sten 
to include a disjointness test in the union caiculation, to 
develop rules to choose between cco.ftlicting pairs of 
disjoint unions and to write special rules for calculating 
the level two through five information about the disjoint 
union given only the level two through five information 
about the component’ sets. This extension to the basic 
system has great potential for improving system accuracy 
because of the prevalence of disjoint unions as input sets 
resulting from the preprocessor translation from range 
values into unions of partiticn sets as discussed in the 


previous sec ion. 
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The subset test cannct generally arrive at a 
specific answer for the size of an input set as the disjoint 
test can. It can, however, Shorten the union or intersec- 
tion list for which the bounds and estimate calculations 
must be made. The 'testallu' and 'testalli' rules perforn 
this subset test for the union and intersection rules 
respectively. The 'testallu' rule inputs the union list and 
determines whether any set in the list 1s a supset of any 
opner. If so, the subset is removed from the output list 
because such a subset could not possibly contribute any 
tuples to the union which have not already been contributed 
by the set which contains it. In a Similar manner for 
intersections , the ‘testalli' rule discards the supersets 
because they cannot possibly contribute any tuples. other 
than those contributed by the’ subset. The specific subset 
rules implemented in this system are discussed in some 
detail in [Ref. 1]. 

The disjoint and subset rules might be more appro- 
priately included in the preprocessor section of the query 
subsystem because they do actually alter the input to the 
actual calculation portion of the system in the case of the 
subset rules or abort the calculation in the case of the 
disjoint rules. It was decided to have these rules as part 
of the calculation section instead because they deal with 
the content of the sets in the input unions and intersec- 
tions. The preprocessor 1S mestly concerned with arriving 
at an optimum format with which to enter the input set and 


not so much with the meaning of the sets involved. 


De. IMPLEMENTATION OF RULES FOR THE VARIOUS LEVELS 


The actual calculation portion of the yuery subsysten 
receives three pieces of information from the preprocessor. 


These are the statistic being queried, the set being queried 
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in prefix notation, and the level of knowledge desired by 
the user. This information is contained in the three of the 
arguments of the 'dosup', ‘doinf', or ‘doest' rules called 
by the preprocessor. Depending on the value of these argu- 
ments any of several different rules could be invoked to 
calculate the appropriate bounds and estimates. The current 
system contains rules for answering queries on the size 
statistic only. (Chapter IV contains a discussion on how to 
extend the system to cther statistics.) The prefix form of 
the set will have either an ‘tand' functor or an ‘or' functor 
and will match a different set of rules in the ‘'sups', 
‘infs' and ‘tests' files depending on the value of this 
functor in the set notation. Finally, the level of knowl- 
edge specified by the user will cause only the appropriate 
calculations for the desired level to be executed. 

As an example of this rule matching process, we will 
continue the example that was started in the previous 
section. This section concluded that the preprocessor had 
converted the ie elaee input set into 
and ({ male,or([chol(1),chol(2) }]) () for the upper bound calcu- 
lation and and({male,chol(1) }) for the lower bound calcula- 
TVON. Suppose also that the user had desired that level 
five be used to calculate the size of this set. In this 
case, the preprocessor would call the ‘sup' portion of the 


query subsystem with the goal: 
dosup(size,and({ male,or ({ chol{1){ehol(2)))} )),_,suec, ove 
and the ‘inf' portion of the sutsystem with the goal: 
doinf(size,and({male,chol(1) }]),_,Inf,5). 


The PROLOG system attempts to match these goals with the 
Clause heads of all the rules in the 'sups*' and ‘infs' files 
respectively until a rule with an appropriate clause head is 


FOUnG. The only rule in each of these files which will 
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match will be the appropriate general upper and lower bound 
rules for calculating bounds on intersection sets thus 
further restricting the possible rules which could be 
applied. Finally, the general 'dosup' and ‘'doinf' rules 
call each of the ‘'dosup?i' through ‘dosup5' rules~= and 
‘doinf1' through 'doinf5' rules passing to these rules the 
Set to be queried and the level of knowledge desired by the 
user. 

The level one calculations are unigue from the remaining 
levels in that they are always executed regardless of the 
specified level. Level one calculations are always executed 
because they are a part of the fundamental structure of the 
query subsysten. The level one ‘dosup' rule calls the 
"supall' clause passing to it the list of the component sets 
in the union or intersection. The 'supall' clause tests 
each component set to determine if that set has data stored 
about itS size in the internal PROLOG database or in the 
system cache. If the set happens to be an actual partition 
set of the database abstract as 1s the case with the set 
"‘male' above, the level one database abstract will have the 
actual size of the set recorded in the internal database. 
This size will merely be returned as an upper’ bound on the 
size of this component set. slits however, the set has no 
information stored about it in the internal database, as in 
the case of or({chol (1),chol (2) }} in the example above, the 
"‘Supall' rule recursively calls the 'dosup' rule with this 
component set as the argument. 

The purpose of this call to the 'dosup' rule is not only 
to return an upper bound on the size of the component set, 
but to convert that set into a true ‘simple set' with 
respect to level five information. The term ‘Simple set' 
means that the set should have all information stored about 
it in the system cache that an actual partition set would 
have stored about it when system execution begins. In the 
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case of our example, the set or([{[chol(1),chol(2) ]) would 
have to have an upper bound on its size and an upper bound 
on its level five tagged frequency distribution list 
asserted into the system cache to fulfill these reguire- 
ments. The lower level call to the ‘dosup' rule executes 
exactly the same as the upper level call to that rule except 
that when the supall clause tests each of the component sets 
of this union, they are already simple sets. Therefore, the 
list of the tabulated sizes is returned and the level one 
execution continues. 

For each of the remaining level , a rule exists which 
says, for example, that the corresponding dosupx goal 
succeeds if the des. 2d level is not egual to x. In this 
case only the simple test for level five fails and back- 
tracking must occur to satisfy the ‘dosup5" goal. The 
second ‘dosup5' rule contains the calculation needed to 
obtain the appropriate upper bound. The two ‘dosup5* rules 
acre shown in figure 2.1. In these rules L is the intersec- 
tion set list and I is the level of knowledge requested by 
the user. The goal established by the ‘dosup' rule is the 


following: 
dosup (size,and({ male,or ({ chol(1),chol(2) ]}),_,Val,5). 


The PROLOG system will m ch this goal to the first rule in 
figure 2.1 and instantia-e the variable L to the set list 
and the variable I to 5. The first subgoal which this rule 
will then attempt to prove will be not(5=5) which of course 
will fail, causing the system to backtrack to the next 
‘dosup5' rule. This rule instantiates the same two vari- 
ables but the first subgoal, 5=5, succeeds this time and the 
second subgoal is then established. The 
‘data (attrilLutes,Alist)' subgoal retrieves a list of the 
attribute names for the given database instantiated to the 


variable Alist. This list is provided in the ‘tattrlist' 
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file which contains all the database specific information 
needed by the system and which was consulted at the begin- 
hing of execution. The next sukgoal set up by the system is 
the ‘map’ subgoal which causes the ‘sumfregs' rule _ to be 
executed with each element of the Alist as its first argu- 
ment. The rules for the 'map* functional and for the ‘'sua- 
fregs* clause and its subordinate clauses are shown in 
figure 2.1. 

The ‘sumfreg' goal is first set up with patno as the 
value of the Attr variable and [male,or({[ chol(1),chol (2) }) ] 
as the value of the variable lL. The *sumfregs' goal first 
establishes the subgoal ‘getfreqs5' which returns a list of 
the two level five frequency distribution lists for the 
patno attribute for the two sets in the list. Note that the 
success of this clause is only possible because the lower 
level call to the ‘dosup' rule made in the level one calcu- 
mteron With or ([_chcl(1),chol(2) }) as the argument set 
converted this set to a simple set with respect to level 
five. In other words this lower level call asserted an 
upper bound ona level five tagged frequency distribution 
list for this union for all the database attributes. 

Next, the system passes this list of freyuency distribu- 
tions cn to the ‘'findsum5' rule. This rule is the heart of 
the level five calculation. It is a recursive rule whose 
basis is the condition that the first frequency distribution 
in the list of frequency distributions has been reduced to 
the empty list. On each iteration, the ‘findsum5' rule 
considers the first item in the first frequency distribution 
list and calls the 'minfregofitem' rule to return the 
Minimum frequency with which this item occurs in any of the 
frequency distributicns of the list. This miniaum frequency 
is then added onto a running sum and it is also inserted 
into a new list which will eventually represent the upper 


bound on the tagged frequency distribution of the input 
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intersection set. Then the ‘findsum5' rule is called recur- 
Sively with the remainder of the first distribution substi- 
tuted for the original distribution. When the basis condi- 
tion for this recursion is satisfied, the running sum will 
represent an upper bound on the size of the intersection 
because only items which occur in the first set of the 
intersection can occur in the intersection set. Those that 
do occur in the first set can cccur in the intersection set 
with a frequency no greater than the frequency with which 
they occur in that component set in which they occur the 
least frequently. Therefore the sum of theSe mininun 
frequencies represents an upper bound on the size of the 
intersection set. Note also that the new list representing 
the upper bound on the level five distribution of the inter- 
section 1S asserted into the system cache to convert this 
intersection into a Simple set with respect to level five. 
In the case of this query this conversion is not’ really 
necessary because the intersection set is the higherst level 
set for this query. However, the conversion is completed to 
Maitain the generality of the ‘dosup! rules. (An alternate 
approach to this problem is presented in chapter IV.) 

The concept of a union or intersection being converted 
into a Simple set is fundamental to the execution of this 
system. When the upper level "supall' rule returns from its 
execution, not only will it return a list of upper bounds on 
the sizes of the component sets in the input set list, but 
it must have caused the additional side effect of tabulating 
the appropriate level informaticn for each of the component 
Sets into the system cache. In this way, when the 'dosup5'! 
rule is called with the input set list, the system is 
assured that level five information is in fact tabulated for 
every set in the input list. 

The above discussion applies regardless of the level of 


knowledge specified by the user. The level x dosupx rule 
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dosup5(size,and (L Vadis s= not (1=5).. 
dosups (size, and (L}4~,Valy i) [= lero 
data(attributes, ast) ,... 
map(sumfregs,Alist, L ,Vlist), 
minl(Vlist,Val). | 
ma ’ Pealns ; 
nab (Fénctot,( Arg! | arglist J, 0therargs,|Valiothers }) | 
makelist(Functor,Arg1,O0therargs,Val,Function), 
Loe F UnGtLonm,ca i 1 (xX) 
. map (Functor, Arglist, Ot herargs,0thers) . 
makelist (F,A1,0t al, Resultlist) :- 


herargs : 
append ([ Fif{ 1fothérargs )],{ Val}],Resultlist) . 


sumfregs(Attr Bauer = (ERE ee obet te RECL) « 
findsun5(F1,Va calle 
asserta (cache(freg SG) ,dana(h) ,Attr,AL,sup)). 


etfregs5 ‘ 

jetiredss ( First fest, ee eer = 
Pa eoELed ‘st, Firs rete eva tt le), 
getfreqs5 (Rest, £,F1,AEtr). 


fare sune tt Others ee a 
findsumd5 t m,Fregqj|Rest ]] Others],S, ee ae J Ae) :- 
minfreqofitem(Im, Freq, Min Ot ers), 
findsum5({RestjOthers},0ld,Al), 
S is Olid Min. 


minfregofitem (Item,Sofar,Sofar,, }). 
minfregofitem (Item,0,0,_).. 
minfregofitem (Item,Sofar,Min,[ First] Rest]) :- 
qmember (1 
min (Sofar,New, Sofar2) 


| Min regofitem(Item,Sofar2,Min,Rest) . 


Figure 2.1 PROLOG Code for the Level Five Upper 
Bound Calculation for Intersection Sets. 


temn,First,New), 
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will always return an upper bound for the input set and will 
always cause the side effect of asserting into the systen 
cache the level one and the level x information about the 
input set thus converting that set into a simple set with 
respect to level x. An analogous construction exists for 
the doinf rules. The general ‘doinf' rule calls ‘doinfi' 
through ‘doinf5" with only the level one and the specified 
level calculations being executed. The level one ‘doinf' 
rule calls the ‘tinfall' rule which returns a lower bound on 


the size of all the component sets and causes the side 
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effect of tabulating ower bounds on the level 1 and thespe- 
cified level inform. ion about the input set. 

We have thus far avoided discussing the calculation of 
the estimate for the size of the input set. Recall that the 
Minimal cover set used to calculate the upper bound on the 
example set was different from the maximal union covered by 
the queried range used to calculate the lower bound. i 
order to obtain a truly meaningful estimate as to the size 
of this set, we must calculate an estimate for both of the 
above sets and average the results. For this reason, the 
‘doest' rule is called twice, once with each of the two 
forms of the input set used as its argument. The execution 
of the ‘'doest' rule is guite similar to the 'dosup" and 
‘doinf' rules except that only level one estimates are 
calculated. The ‘'doest*t rule calls the testall' rule for 
the input set list which returns an estimate as to the size 
of each of the sets in the list. The *testall* rule either 
returns the tabulated size of each component set or calls 
the ‘doest' rule recursively for sets which have no tabu- 
lated size estimates available. Of course, when each of 
these lower level ‘doest' calls returns, the argument set 
has an estimate as to its size recorded in the system cache. 
Because the estimate rules were the same for all ievels of 
knowledge, these calculations did not produce vey inter- 
esting results. For that reason, the estimates are not 
discussed at all in chapter III. 

So far we have only discussed with any meaSure of detail 
the implementation of the rules for level five upper bound 
calculations. The general structure of the calculation for 
the remaining bounds calculations is similar, however, there 
are some Significant differences. First of all, there are 
ho cules to calculate new ufper bounds for levels’ two 
thr.-agh four ur °-n sets and new lower bounds for levels two 


through four in —~rsection sets. Therefore, the ‘dot rules 
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for these levels merely use the already calculated level one 
bounds and cause the side effect of converting the input set 
to a Simple set with respect to the given level. Regardless 
of whether they calculate a new bound on the size of the 
input set all of the level calculations above level one 
involve a call to the 'map' functional to execute a given 
calculation for each attribute in the attibutelist. The 
calculations are also quite similar. Each one involves a 
retrieval rule which makes a list or lists of the informa- 
tion in the database abstract corresponding to the input 
sets. Then a recurSive function passes through this infor- 
Mation list building a bound value for the size of the set 
and a bound for the appropriate level information corre- 
sponding to the set. (If the calculation corresponds to one 
of the levels where no new bounds are possible, this recur- 
Sive function calculates only a bound on the appropriate 
level information.) This level information is then asserted 
into the system cache and the calculation is repeated with 
the next attribute. When the map functioanl returns its 
list of answers, the sup rules choose the Minimum from the 
list as the least upper bound while the inf rules choose the 
maximum from the list as the greatest lower bound. BOGed 
thorough discussion of the equations used in these rules, 
refer to [Ref. 2]. 


E. SUMMARY 


This chapter has discussed in detail the implementation 
of the Database Abstract Query Estimation System, including 
its two subsystems, the database abstract generation 
subsystem and the query estimation subsystem. The following 
Chapter will present the results of various test cases 
executed on this prototype system and discuss the concept of 


developing the optimum database abstract. 
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III. ANALYSIS OF SYSTEM TESTS 


A. TESTING STRATEGY 


System testing of the DAQUES system was not conducted in 
a completely rigorous manner, hcwever, sufficient test cases 
were chosen to demonstrate the strengths and weaknesses of 
the system and to show how the system can be customized to 
Suit the requirements of different users. Three different 
factors were varied during the tests to demonstrate several 
characteristics of the system. These factors were the level 
of knowledge of the database arstract, the granularity of 
the database abstract, and the use of partition aliases to 
construct the database abstract. The ten gueries listed in 
table I were run on each of the test database abstracts to 
determine the effects of varying the above factors. 
Inspection of table I will show that the queries used repre- 
sent a good variety of different sets that the user may 
query. Of the ten sets in the table, five are intersections 
and five are unions. Some of the setS include a large 
percentage of the total database while others represent only 
very small sets. A union of three setS was chosen as one of 
the ten sets to demonstrate the system's ability to handie 
larger unions or intersections. Each of the database attri- 
butes was used several times in the ten queries to provide a 
more even mix of test cases. 

Three characteristics of the system execution were 
compared during the tests inorder to arrive at useful 
conclusions. The first of these characteristics was tight- 
hess of the upper and lower bounds calculated by the system. 
This measurement apperared to be the most intuitive and 


eaSily observed indication of how accurate the system 
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TABLE I 


Ten Queries Used to Test the Systen 


Query Number Set Actual Size 
1 daGizeezoy es temp (326.9,37. 3) 290 
2 chol (115,277) * pred (3,20) 340 
3 atno (17 47) + da (17 47) + pred(17,47) 385 
4 poe 7.8,4 = 3) male 8 
5 female * chol(129 ae 244 
6 patno (3 oe + da (30 g } 327 
7 patno(25,27) * da (13, 18) iv 
8 chol(115,230) + pred(22 enh 435 
9 temp (36.8,37) * Batno (46,4. ) 13 
10 -pred(12,1/7) + da(19,2 101 
NOTE (1) The first ety can be faraphrased as the set of 
all patients who either have a disease activity 


between 12 and 26 inclusive or have a temperature 
between 36.9 and 37.3 inclusive. The second quien 
refers to the set ofall patients who have both a 
cholesterol level between115 and 277 inclusive and 
rednisone level between3 and 20 inclusive. 
NOTE (2) Query number 4 is also referred to as: 
hitemp * male . 
and Query number 6 1s referred to as: 
lowpat + hida 
when using the aliased database abstract. 


responses were. The second characteristic measured was the 
response time of the system for the given input query. This 
measurement used the FROLOG ‘cputirme' function which in turn 
uses the Unix operating system time utility. Therefore the 
response time measurement may vary pluS or minus’ several 
seconds depending on the user lcad on the system. The test 
cases were all run under lightly loaded conditions to mini- 
mize the adverse effects of these discrepancies. Finally, 
the third characteristic measured as output was the storage 
reguirements to hold the database abstract file. The PROLOG 
‘consult! function returns the size of the consulted file 
following a successful consult operation. This logical file 
size was used because it represents the the storage required 
by the database abstract in the internal PROLOG database. 

Of course, a better system is determined by a tighter 
bound on accuracy, a shorter speed of execution, anda 
smaller storage requirement for the database abstract. The 


following three secticns of this chapter will discuss each 
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of the variables in turn and describe how varying the: 


affected the system characteristics mentioned above. 


B. LEVEL OF KNOWLEDGE OF THE DATABASE ABSTRACT 


The ten queries of table I were used to test this factor 
by running each query five times, once for each ievel of 
knowledge. Figure 3.1 shows the graph of average system 
accuracy and speed of execution over the ten queries tested. 
This graph was prepared using a database abstract with all 
of the attributes partitioned into four divisions, however, 
the general shape of the accuracy curve is identical for all 
other granularities tested. This curve does show only the 
average accuracy and response time results of ten queries, 
however the reader can gain an appreciation for the general 
tendencies of the system accuracy and response time as the 
level of knowledge is varied even if no appreciation is 
gained for the sizable variance in responseS among the 
gueries tested. 

This graph shows the manner in which the accuracy of the 
system response as well as the speed of system execution 
increases as the higher levels of knowledge are used for the 
queries. (Note that an increase in accuracy of query 
respoi.se translates into a downward slope on the graph.) 
System accuracy increases ata relatively constant and 
rather slow rate for the first four levels, however the 
increase in accuracy takes a relatively sharp curve upward 
between levels four and five. One reason for this greater 
increase in accuracy is the fact that the queries tested 
included five unions and five intersection sets. Recall 
that the system did not have any rules for calculating new 
upper bounds for set unions for levels two, three, and four 
hor did the system have rules for calculating new lower 


bounds fo. set intersections for the same three levels. 
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These levels for which news bounds could not be calculated 
merely used the level one bounds for that query. Therefore, 
It was not until level five was chosen that any given query 
could show an improvement over level one in both the upper 


and lower bounds simultaneously. 


Also contributing to the relatively sharp increase in 
system accuracy when level five calculations were chosen 1s 
the fact that level five knowledge seems to represent a 
sharp increase in the amount of the original information 
content of the database retained by the database abstract. 
The lower level database abstracts contain information about 
the frequency of occurrence of certain values in each parti- 
tion set, but no information about the values to which these 
frequencies refer. The usefulness of comparisons between 
the database abstract information stored about component 
sets in a union or intersection is limited because there is 
no way of knowing to which values these frequencies refer. 
In some cases, worst case bounds must assume that the 
frequencies refer to the same value and in other cases the 
worst case assumpticn is that the sets are disjoint. 
Because bounds must be calculated using worst case assump- 
tions, the lack of information about the values to which the 
database abstract frequencies refer resultS ina serious 
degradation in accuracy of system responses. The tagged 
frequency lists of level five ccntain these values and asso- 
ciate them to the frequency with which they occur. This 
added tag informaticn appears tc raise the real information 
content of the database abstract to a plane much higher than 
the previous levels. These results raise the interesting 
question of whether an intermediate level of knowledge ought 
to contain the mode value aS well as the mode frequency and 
the second most frequent value together with its frequency. 
Perhaps, rules more Similar to the level five rules could be 
developed to improve the accuracy of the system responses 
without a significant degradation of system response time. 
This question is discussed in mcre detail in chapter IV. 

The second curve of figure 3.1 Shows the steady increase 
in the time required to complete a system guery as the level 


of knowledge is increased from cne to five. The increase in 
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this system characteristic is Buch more constant than the 
increase in system accuracy. This trait indicates that the 
jump in accuracy encountered in using level five information 
is not matched by a Similar increase in response time. 


However, figure 3.2 shows that the storage area required for 
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the database abstracts for the various levels of knowledge 
also increases at a constant rate for levels one through 
five. The average response time for the ten gueries 1s also 
shown on figure 3.2 to demonstrate that both of these char- 
acteristics increase at relatively the same rate as the 
level of knowledge is increased. Figure 3.3 shows a plot of 
storage reguired for the various database abstracts versus 
the average response times of the ten gueries tested for the 
Same database abstracts. This figure demonstrates that the 
increase in both storage requirements and response times is 
relatively constant as the level of knowledge 1s increased 
from one to five. We can conclude from this graph that the 
tradeoff made for greater accuracy at higher levels of 
knowledge results in comparable penalties of increased 
response times and database abstract storage requirements. 
From the preceding discussion, we can conclude that if 
the user needs relatively tight bounds on the size of the 
set being queried and 1s not overly concerned with the 
response time or the memory reguired to execute the guery, 
level five is most definitely the optimum level of knowledge 
to use. However, if he is limited severely by the amount of 
memory available in which to stcre the database abstract, he 
may wish to choose to use a lower level of knowledge. With 
this option, however, he will suffer a large penalty in the 
accuracy of the answer calculated by the systen. If memory 
is not a Major concern, the user will obtain the most accu- 
rate results per second of response time using level five 


knowledge. 
C. GRANULARITY OF THE DATABASE ABSTRACT 
1. Varying Granularity of All Attributes 


The decision as to which level of knowledge to use 


for a given query is one which is made by the user. Because 
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Figure 3.3 Space Versus Time Plot 
for Various Granularities. 


all five levels are available tc the user at any given tine, 
the user may choose one level fcr one guery to obtain satis- 
factory results and a different level for a subsequent 
query. On the contrary, the decision as to the granularity 
of the database abstract is made by the database adminis- 
trator and cannot be changed unless the database adminis- 
trator recalculates the entire database abstract from the 
original database. Because this system 1S primarily 
designed to be used with very large databases or when access 
to the actual database is not available, this decision is 
much more permanent than the one discussed in the previous 


section. 


Se) 


The database abstract generation subsystem of DAQUES 
is capable of constructing a database abstract with a 
different granularity used for each of the numeric attri- 
butes. We will attempt here to discover what characteris- 
tics of a particular attribute tend to make a finer 
granularity more advantageous in terms of accuracy and to 
make that improved accuracy worth the penalty of increased 
system response time and increased storage requirements. 

Figure 3.4 shows the differences between the upper 
and lower bounds, averaged over the ten gueries, for the 
five levels of knowledge. The different curves represent 
granularities of two, four, six, and eight partitions for 
all of the attributes in the sample database. These curves 
demonstrate that the accuracy cf resultS increases signifi- 
cantly as the granularity of all of the attributes is 
changed from two to four. However, the increase in accuracy 
as the granularity is shifted from four to six and from six 
to eight is less significant. Figure 3.5 shows that the 
increase in average system respcnse time is fairly constant 
as the number of partitions in the database abstract is 
increased from two to eight. Figure 3.6 shows that the 
increase in storage reguired for the various database 
abstracts is relatively constant as the granularity is made 
finer. 

Figure 3.11 shows the storage requirements of the 
various mixes of the database abstract for levels of knowl- 
edge one through five. From this figure we can see that 
partitioning an attribute more finely results ina very 
predictable increase in the size of the database abstract. 
For leveis one through three doubling the number of parti- 
tions for the numeric attributes nearly doubles the storage 
requirements of the database abstract. For levels four and 
five, doubling the number of partitions results ina more 
finely partitioned database abstract approximately one and 
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one half times the size of the more coarsely partitioned 
one. This smaller increase in the storage requirements at 
higher levels is due to the fact that many of the values in 
the frequency lists of these two levels are merely redistri- 
buted as the attibutes are partitioned more finely. 

From these curves we can conclude that increasing 
the number of partitions on all attributes in the database 
abstract causeS a steady increase in the penalties of 
greater system response time and storage requirements but 
results in smaller and smaller improvements in the accuracy 
achieved. Therefore it appears that at a certain point the 
penalties incurred by dividing the database into more 
partitions is not worth the small increase in accuracy of 
query responses obtained. Although these conclusions have 
been drawn ‘from experimentation on only one databse, we 
appears resonable to assume that the results could be gener- 
alized to other databases. Considering this restiction on 
the usefulness of increasing the granularity of all the 
attributes of the database abstract, it seems advantageous 
to investigate whether dividing only certain attributes 
more finely and leaving the others less finely partitioned 
will result in nearly the same accuracy improvements for a 
smaller penalty in response tize and size of the database 


abstract. 


2. Varying Granularity of Cnily One or Iwo Attributes 


To investigate the advantages of partitioning 
certain attributes more finely than others, the system was 
tested with seven different mixes of partitioning granu- 
larity. For each of the numeric attributes in the database, 
a database abstract was created with the given attribute 
divided into eight partitions and the remaining attrikutes 
divided into only four. Two additional database abstracts 


(mix and nmix) were also created, ‘mix' dividing the temp 
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and chol attributes into eight partitions and the remaining 
humeric attributes into four partitions. The 'nmix' data- 
base abstract reversed this construction resulting in four 
partitions for the temp and chol partitions and eight parti- 


tions for the other other numeric attributes. Nalbw ec wk 
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provides an explanation of the granularities of the various 


attribute partitions in each of the database abstracts named 


TABLE II 


Breakdown of Various Mixes of DBAs 


DBA NAME NUMBER OF PARTITIONS FOR THE GIVEN ATTRIBUTE 
“ss |séPATNO) dSSEX.~—s«éDA'~=*“‘«é‘éPEM©=©©CHOLL—sé«éiPRREDS 
two a ne 
four | b 2 ~ = 4 4 
six 6 2 6 6 6 6 
eight 8 Z 8 8 8 8 
mix 4 Z 4 8 8 4 
nmix | 8 Z 8 4 4 8 
pamix | 8 2 4 4 4 uy 
dmix 4 Z 8 uy 4 4 
tmix | 4 z 4 8 4 4 
cmix 4 2 4 4 8 4 
pmix | 4 Ze 4 4 4 8 

in the figures in this chapter. Figures 3.7 through 3.9 


represent the accuracy results for three of the more inter- 
esting gueries tested on this system. Each figure shows the 
actual upper and lower bounds calculated for all seven of 
these database abstracts together with the actual value of 
the query calculated from the original database. 

Careful study of these figures shows that improve- 
ments in accuracy were shown only when one of the components 
of the intersection or union set being queried involved the 
attribute that was partitioned more finely. For example, 
figure 3.7 shows that the query requesting the size of the 
set of all males with temperatures between 37.8 and 40.8 
inclusive resulted in one response fron ail database 


abstracts with the temperature attribute divided into only 
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four partitions and a different, more accurate response when 
utilizing the catabase abstract with the temperature attri- 
bute divided into eight partitions. This result 1s also 
present, but is less obvious in those gueries where the 
gueried set involves two numeric attributes. In figure 3.9, 
the query involves the patno and temp attributes and the 
database abstracts which partition these attributes more 
finely, tmix, pamix, mix and nmix all result in bounds which 
are tighter than the other datakase abstracts. However, it 
is interesting to note that the database abstracts which 
partition the patno attribute more finely result ina ich 
greater improvement in accuracy than those which parti-ion 
the temp attribute more finely. The same phenomenon is 
present in figure 3.8 with a finer partitioning of the pred 
attribute rather than the chol attribute resulting in the 
greater improvement. 

It is suspected that because the temp and chol 
attributes are highly skewed (i.e. have a very large mode 
frequency), increasing the number of partitions for these 
attributes from four to eight dces not have the same ability 
to increase the information content of the database abstract 
aS with more evenly distributed distributions. In the 
highly skewed distributions, one particular value, the mode, 
will always occupy only one partition, possibiy along with 
several other values. Varying the granularity of the parti- 
tion cannot shift any of these mode values to separate 
partitions. Consequently, this very large partition, which 
contains a large percentage of the database tuples, will 
always be present and the amount of information that can be 
stored about it is independent of the granularity of the 
database abstract. Dividing the partition nore finely for 
such attributes only resultS in increasing the information 
content of the remaining partitions which represent only a 


modest percentage of the entire database. 
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Another reascn for the increased accuracy of systen 


response with database abstracts which are partitioned more 


finely is that the union of actual partition sets chosen to 
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Figure 3.9 Accuracy of Query 9 Varying 
the er aubrey of DBAs : 


cover the input set defined by a range of attribute values 
is likely to fit that range more closely than it would using 
a database abstract fartitioned more coarsely. Figure 3.10 
demonstrates this point graphically. In this figure, we see 
the responses to system guerieS onan increasingly larger 
range of the numeric attribute, temperature. When the 
queried range 1S very small, say 40 to 40.8, the minizal 


union of actual partition sets for both database abstracts 
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consists of only one set. However, the one set for the 


finer partition is much smaller than that for the coarser 
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one. Therefore the calculated upper bound for the query, 
which is based solely on this minimal union, is much closer 
to the actual size of the set when a finer partitioning is 
used for the database abstract. In the same way, the calcu- 
lated lower bound is E-ased on the maximal union of partition 
sets which is covered by the desired range. For a very 
small queried range, this union is the empty set, producing 
a lower bound of zero. However finer partitioning of the 
database abstract results in smaller partitons and conseg- 
uently a given yueried range is more likely to cover such 
smaller sets than the larger ores resulting from a coarser 
partitioning. So, for a randcm range queried by the user, 
the finer the partitioning of the database abstract, the 
more closely the unions actually used for calculating the 
upper and lower bounds will be to the queried range and 
conseguently the more accurate the results will be. Note 
that figure 3.10 shows the system calculating the same upper 
and lower bounds for all eight query ranges tested when the 
coarse database abstract was used. When the finer parti- 
tioning was selected, however, the system was able to calcu- 
late several different bounds and therefore was able to 
discriminate between these varicus queries tested. 

Returning to the three example gueries of figures 
3.7 through 3.9, if we examine the response times for these 
calculations, we find that the level five calculation for 
the query shown in figure 3.7 ranged from 6.5 to 7.5 seconds 
for the database abstracts which divided the temp attribute 
into only four partitions, while response times of 18.6 to 
25.2 were experienced for those database abstracts with the 
temp attribute partitioned eight ways. The time 18.6 
resulted from the database abstract which only divided the 
temp attribute into eight partitions while the 25.2 time 
resulted from the database abstract which partitioned all 


the attributes eight ways. So it seems that increasing the 
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granularity of even a single attribute in the database 
abstract will result in slower response times on the same 
order of magnitude as increasing the granularity of ali the 
attributes. Note also that doubling the number of parti- 
tions in the database abstract results in considerably more 
than doukling the system response time. This characteristic 
is due to the fact that response time is not only a function 
of the size of the database abstract but also of the 
complexity of the input set. A set specified by a range 


such as: 
male * chol (115,270) 


will be preprocessed into form a) below for for a database 
abstract with four partitions and into form b) fora data- 


base abstract with eight partitions. 


a) and({male,or ({ chol(1) ,chol(2) ,chol(3) }) ]). 
b) and({male,or ({ chol(1) ,chol(2),chol (3), 
chol(4),chol(5),chol (6) }) ]}). 


Obviously the set of form b) Will require much tlonger to 
process than form a). Similarly, when only the gueried 
attrirute is partitioned more finely, the input set is still 
preprocessed into form b) and the complexity of the set 
causes the system response tite to approach the response 
time of the system when the database abstract with all 
attribute partitioned finely is used. If the query set does 
not involve the single attribute which is partitioned more 
finely, then the system response time is closer to the 
response time of the system using the databse abstract with 
all of the attributes partitioned more coarsely. HOWEVEL, 
keep ir mind that a finer partitioning of a single attribute 
will result in increased accuracy only for those gueries 
which involve the attribute partitioned more finely. From 


the above discussion we can conclude that if the user 


70 


community needs additional accuracy and a single attribute 
predominates the system gueries, it may be worth mixing the 
granularities of the various attributes inthe database 
abstract to obtain the optimum accuracy at the minimum cost 
in system response time. 

Figure 3.11 shows the storage reguirements for the 
database abstracts of table II. Inspection of this figure 
reveals that increasing the granularity for only one attri- 
bute in the database abstract results in a much smaller 
increase in the storage required than increaSing the granu- 
larity of all the attributes. Therefore, if primary memory 
available is a major concern, dividing of only the most 
commonly used attributes into finer partitions may be the 
optimum method of constructing the database abstract. 

From the above paragraphs, we can conclude that 
there iS no single optimum partitioning strategy which will 
be advantageous for all queries. However, if the database 
administrator iS aware of a particular attribute which is 
involved in a large percentage cf the system queries, he may 
decide to divide that attribute into more partitions than 
the other attributes. For this reason, the query subsysten 
of the DAQUES system allows the user access to three 
distinct database abstracts for his gueries. The intent of 
this arrangement is to provide a database which is rather 
coarsely partitioned as the ‘normal' database abstract, 
resuiting in rather loose upper and lower bounds but more 
rapid response. If the user chooses the ‘alternate! data- 
base abstract, he will be uSing one in which all the attri- 
butes are partitioned more finely but the response time for 
each query will be much Slower. Finally, the user may 
choose the ‘'special't database abstract which is partitioned 
finely on one or two commonly gueried attributes and more 
coarsely on the remaining attributes. This database 


abstract will result in more accurate results than the 


71 


" 


RR ae aa a area an ag on a = 


1002 = 
a) 
= elght -- '* 
LOuUL == ep 
haves 
nix =a 2 
pamix,dmix, 
S0a CM1x, 
60-{- 
S (K 
ie: 
oY | 
jee 3k 
A E [- 
Go) 
E 
x 
4O-j- 
ZO Se 





vai st -- 2 
pmix 





SSS a ae Se > se ee ee eee eee eee eee ee eee 


Figure 3.11 Storage 
Granularities 


Requirements for DBA*'s of Various 
for the Medical Database. 


Vz 


‘normal’ database abstract for only those queries involving 
the ‘special’ attribute or attributes and the response time 
will be less than that of the same guery executed with the 
"alternate" database abstract. Because the system only 
consults one of these database abstracts into the PROLOG 
internal database at a time, the user does not pay any added 
price in storage reguirements other than the secondary 
storage reguired to hold the three separate database 
abstracts. Because secondary storage iS uSually readily 
available compared to primary storage in the form of the 
PROLOG internal database, this penalty seems a small price 
to pay for the added flexibility of this systen. 


D. USE OF ALIASES TO IMPROVE SYSTEM ACCURACY AND RESPONSE 
TIME 


The previous section analyzed a method with which the 
database administrator could tailor the database abstract to 
the specific requirements of a particular group of users by 
partitioning more finely those attributes which are involved 
most frequently in the guery sets of the users. This 
section analyzes another method of tailoring the database 
abstract to users by partitioniny various attributes along 
boundaries which represent important sets in the application 
field and then assigning aliases to those partition sets in 
such a way that the aliases describe the sets. This method 
results in the database abstract which is the most specific 
to a class of users but also produces the most accurate 
results for gueries on the special aliased sets. 

The method of defining special boundaries for meaningful 
partition sets of a particular attribute was described in 
Chapter II in the section on the database abstract genera- 
tion subsystem of DAQUES. The technique outlined in that 


chapter was used to define partitions of the patno, da and 


us 


temp attributes, assuming that the boundaries used repre- 
sented meaningrul descriptions of low, middle and high 
ranges for those attributes. The remaining two numeric 
attributes were partitioned using the algorithm of the 
generation subsystem with no attempt to produce meaningful 
partition sets on those attributes. Figures 3.12 and 3.13 
Show the guery results of queries 4 and 6 of table I using 
the specialized database with only three partitions for each 
attribute. On separate curves are shown the results of the 
same queries using algorithmically generated database 
abstracts containing four and eéight partitions. A compar- 
ison o: these curves shows graphically the tremendous 
improvement in accuracy which results from using the alias 
technique. Because the specialized database abstract uses 
only three partitions for all the attributes, it requires 
much less storage than the database abstracts with combina- 
tions of four and eight partitions for the attributes. The 
System is able to evaluate gueries on specific partition 
sets much more efficiently than it can evaluate gueries on 
sets defined by arbitrary ranges of the attribute. This 
capability is caused by the method of substituting unions of 
actual partition sets for sets using specific ranges as 
described in chapter II. Therefore, the queries on the same 
sets executed much more rapidly with much more accurate 
results for this specialized database abstract than for the 
previous unspecialized database abstracts. These response 


times are demonstrated graphically in figures 3.14 and 3.15. 


E. CONCLUSIONS 


From the tests discussed in the previous section we can 
now develop severai conclusions about what the optimum data- 
base abstract woul. he. Recali that our goals for the 


systen, mentioned in chapter I, were to determine the 
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optimum level of knowledge, and the optimum granularity ofa 


database abstract for the query estimation system. Here, 
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Figure 3.15 Response Times for Query 6 
for the Specialized DBA. 


optimum refers to the system with most accurate responses, 


the shortest response times and requiring the least amount 


of primary storage for the database abstract. Of course, 
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all of these goals cannot be realized at the same time 
because they conflict with each other. Therefore tradeoffs 
must be made to produce the system which best suits’ the 
needs of all the users. 

First and foremost, it is very important that the data- 
bkase administrator consider which attributes and sets are 
most commonly queried in the application field. If one 
particular attribute will be involved ina large percentage 
of the system queries, that attribute should be partitioned 
more finely than the other attributes. The guestion of 
exactly how finely the partitioning should be has not been 
determined through these tests. However it is suspected 
that there exists a granularity for each attribute which is 
optimum in the sense that partitioning the database more 
finely on this attribute results in a sharp drop in the 
improvement in accuracy achieved while the increase in 
system response time at least remains constant. It is also 
believed that this optimum granularity is a function of the 
ratio or the mode frequency of the attribute in the database 
to the size of the database. For exampie, a highly uneven 
distribution in which over half of the tuples of the data- 
kase have the mode value will have a much lower optinun 
granularity than a distribution in which the values of the 
attribute are evenly distributed over the range of values of 
the attribute. Of course, the number of distinct values in 
the database for each attribute is an upper bound on the 
optimum granularity for that attribute because if each 
partition contained tuples with only one value for the given 
attribute, any finer partitioning on that attribute would 
not be possible. 

Another way which the datakase administrator should be 
familiar with the afplication field is to identify some 
partitions of database attributes which are meaningful in 


the application field and therefore are likely to be used 
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often in 2k queries. The example used in chapter I was 
thet of 2 low, normal and high temperature ranges. If 
such meaningful partitions are identified, the database 
administrator can define the database abstract partitions to 
correspond to the application area partitions of the attri- 
bute and result in greatly improved accuracy and _ systen 
response tine. 

The optimum granularity of the database abstract, there- 
Tore depends on the needs of the user. If the database 
administrator must provide a system to serve the needs of a 
variety of users, he can use the nornal, alternate and 
special database abstracts to provide one general purpose 
and two specialized database abstracts. If this flexibility 
is still not sufficient, the database administrator must 
make tradeoffs between the needs of his various users to try 
to best serve the entire user ccmmunity. 

The question of the level of knowledge which must be 
used to obtain the optimum results for a particular query 
was determined not to be as important as the granularity of 
the database abstract. Because the user, rather than the 
database administrator chooses the level of knowledge to be 
used for each query, such a decision can easily te altered 
if the results obtained are not sufficiently accurate. 
However, the tests conducted indicated that either level one 
or level five were the optimum levels to use because these 
levels were capable of calculating new upper and lower 


bounds on both union and intersection sets. 


Fe. EFFECTS OF USING PROLOG TO IMPLEMENT THE SYSTEM 


It was mentioned in chapter II that PROLOG was chosen as 
the insplementation language for this system primarily 
because it was expected that this language would provide a 


rule base which was more logical and easily understood than 
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that wkich a Lisp system might frovide. This advantage was 
realized to even a greater extent than anticipated, but 
additional equally important advantages as well as several 
disadvantages to using PROLOG were also realized. Because 
this author's experience with Lisp is guite limited, we will 
concentrate here on the characteristics of PROLOG rather 
than attempting to ccmpare the two languages. 

The syntax of PROLOG (with some exceptions, e.g. ! and 
=..) is extremely natural for a rule based system. Rules 
can be constructed individually and translated into PROLOG 
clauses independent of the other system rules. The form of 
a PROLOG clause lends itself to working on one level of 
abstraction at a time using lower level calls to rules whose 
effect 1S envisioned but whose implementation has’ been put 
off until later. The primary advantage to the PROLOG syntax 
is the way it lends itself tc organizing a program into 
several files of rules, each of which is understandable 
independent of the others. Fcr example, a reader could 
examine the ‘infs' file and understand the rules for 
obtaining lower bounds on the input sets without ever 
knowing that the "'sups' or tests! files exist. 

On the other hand, it is guite possible that a reader 
would need to refer to the ‘master’ file to determine the 
effect of many low level functions which are used by the 
‘infs' file. This leads us to one of the disadvantages of 
the PROLOG system, that is the lack of many built in fuac- 
tions whose effect is already well understood by the user 
community. For example, the system does not have a ‘member' 
function to check for the membership of an individual 
element in alist or an t‘append' function to append two 
lists together. Even the most basic Lisp systems have such 
Piatt in functions. On the other hand, it is quite easy to 
write these functions in PROLOG and, once written, they can 


be used over and over again in various systems. So, except 
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for the penalties of lower system efficiency, we do not see 
this disadvantage as too great a problem. 

In [{Ref. 1], that author spent a great deal of his 
system's execution time determining which rules in his rule 
base to apply and ensuring that all the appropriate rules 
were in fact used for every query. These problems were a 
direct result of uSing Lisp as his programming language. 
PROLOG solves this problem automatically through its pattern 
Matching mechanisms. By providing a sufficient number of 
arguments in the upper level goals, the PROLOG system can 
ensure that when the appropriate values are substituted into 
these goals, only the clause heads of the desired rules will 
match the goal. Of those rules that match, only the first 
successful rule will affect the systen variables. 
Therefore, the programmer can dictate the priority of execu- 
tion of his rules by merely placing the rules in the proper 
order inthe rule base. PRCLOG also provides built in 
features such as the ‘cut' predicate to alter the normal 
search through the rule base whereas in other programming 
languages, such mechanisms would have to be built from prin- 
itives. 

Another important way in whicn PROLOG iends itself to 
implementing a rule-based system is in its automatic manage- 
ment of the internal PROLOG database of rules and facts. 
The system in this paper is constructed as if two separate 
storage areas existed for the original database abstract and 
the system generated cache. In fact, these rules and facts 
are all stored in the PROLOG heap. New items can be added 
to the heap with the ‘assert' command and unwanted facts or 
rules can be removed using the ‘retract' and ‘abolish? 
commands. In this way, converting a set union into a simple 
set with respect to a given level is just a matter of 
asserting a new set of facts into the internal database. 


These mechanisms to automatically manage the internal 
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database eliminate many of the details which would have to 
be worked out to build such a system in another programming 
language. 

Chapter IV discusses in some detail the possible exten- 
Sions to this system which could be effected to increase its 
functionality. These extensions are made possible primarily 
because of the independent execution of the individual 
PROLOG rules when these rules have distinct clause heads. A 
detailed explanation of this idea is deferred until chapter 
IV but suffice it to say here that this rule-based systen 
could easily be extended to answer queries on statistics 
other than ‘size'* andon sets other than unions and inter- 
sections. This flexibility is caused primarily by the 
structure inherent in a PROLOG rule base. 

A final advantage of using PROLOG to implement this 
rule-based system may not be fully realized for several 
years. With no conscience effcrt to do so, this system is 
built to be executed very efficiently ona network of 
parallel processors. For example, if a parallel conjunc- 
tion were used between the subgoals ‘dosup', ‘doinf', and 
"doest', these three independent calculations could be 
carried out in parallel and thus system response time could 
ke reduced considerably. Even more dynamic would be the use 
of such a parallel conjunction in the 'map' functional to 
cause parallel execution of the individual rules for all the 
attributes simultaneously on different processors. These 
two Simple changes in the system would reduce systen 
response time by about an order of magnitude. If a concen- 
trated effort were made from the start of system construc- 
tion, perhaps even greater improvements in response time 
could be realized. 

With all the above points considered, the decision to 
use PROLOG as the implementation language for this rule- 


based system appears to be a wise one. For a penalty of 
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slightly longer response times and more work to build basic 
functions, the PROLOG system provided the advantages of 
logical and easily understood syntax, automatic pattern 
Matching mechanisms fcr rule apflication, easy-to-use tools 
for management of the internal database of rules, and a 
system which can easily be expanded to increase its func- 
tionality. Both of the disadvantages of PROLOG stated above 
are likely to be overcome in the near future when more wide- 
spread use of this language will bring about a larger 
library of built-in system functions and more research into 
implementing PROLOG in parallel architectures wiil dramati- 


cally improve system response times. 
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IV. POSSIBLE EXTENSIONS/IMEROVEMENTS TO THE SYSTEM 
A. EXTENDING THE SYSTEM TO ACCEPT QUERIES ON OTHER 
Saar STICS 


The DAQUES system was written with many seemingly unnec- 
essary arguments being passed from clause to clause in the 
higher level rules. For example, in all three ‘do' rules, 
only one argument is currently needed by the system, that is 
the set being queried. However, the rules also have places 
for passing statistic and attribute arguments. Currently, 
only rules for the "size" statistic exist in the database of 
rules. All these rules have clause heads which appear like 


the following: 


dosup (Size,and (Set_List), _, Answer) 
or 


dosup (size,or(Set_list) ,_,Answer) 


Here, the third argument would contain the attribute of the 
Statistic to be queried. Since the ‘size' statistic is not 
associated with any particular attribute, this argument 
remains anonymous in the current systen. 

In order to extend this system to accept gueries on 
Other statistics of the anput set, the programmer would 
Simply have to add a new file of rules to supplement the 
"sups', ‘infs', and ‘ests' files. For example, to allow the 
system to request upper and lower bounds and eStimates on 
the mean value of a particular attribute in the gueried set, 
the programmer would simply have to add rules with clause 


heads of the form: 


dosup (mean,and(Set_List) ,Attribute, Answer) 
and 


doinf (mean,and (Set_List) ,Attribute,Answer) 


Co 


and 


doest (mean, and (Set_List), Attribute, Answer) 


Similar rules would have to be added for set unions. These 
new rules would in no way interfere with the current systen. 
No hidden side effects from these new rules could possibly 
alter the execution of the current rules because the call to 
the appropriate ‘dot rule would only match the rule in the 
rule base which had the proper statistic as its first argu- 
ment. Cf course, these new rules could use many of the 
already written general rules which are located in the ‘'mas- 
ter' file, thus saving the programmer much wasted effort. 

It is also anticipated that little or no degradation in 
system response time would be experienced by augmenting the 
system in this way. Once the initial invocation of the ‘do! 
clause chose the proper rule to use for the statistic being 
queried, the rules for the other statistics should have no 
effect on the speed of execution. Of course, the added 
rules would increase the space needed to hold the system in 
the PROLCG internal database. If space became a problen, 
the programmer could sacrifice response time for added space 
by aly consulting the rules for the specified statistic and 
th . retracting these rules from the rule-base and asserting 
a different set of rules if a different statistic is 
gueried. 

This system could be extended to accept queries on any 
Statistic provided that the appropriate rules could be 
devised to obtain useful bounds and estimates on those 
statistics. Additicnally, the system could be expanded to 
accept sets other than the unions and intersections which it 
Currently accepts. For example rules with the following 
forms could be added to allow the system to calculate upper 


bounds on sets expressed as complements of simple sets: 


dosup (Size, comp (Bigset,Set_List),_,Answer) 


and 
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dosup (std_dev,comp (Bigset,Set_List) ,Attribute, Answer) 


Here the notation signifies that the desired query set is 
made of all tuples from the Bigset which are not also 
members of any of the sets in the Set_List. 

The important point of this section is not to offer 
specific suggestions as to how to extend the current systen 
but to demonstrate how the flexibility of the current systen 
lends itself to extension. A large part of this flexibility 
is made possible by the pattern matching ability of the 
PROLOG language. To construct such a prototype system with 
the same capability for expansion in a conventional impera- 


tive programming language would be practically impossible. 


B. IMPROVEMENTS IN SYSTEM ACCURACY 


It was discussed in chapter II that a _ significant 
improvement in system accuracy could be achieved if the 
rules for determining upper and lower bounds’ on the levels 
two through four information for set unions could be 
improved. Remember that whenever a set 1S processed by a 
[je rile, the rule must make that set into a simple set 
with respect to the current level of knowledge by asserting 
into the internal database uppfer and lower bounds’ on the 
same level information which is stored about the actual 
partition sets. After a set union is processed by the level 
two rules, the upper bound on the mode frequency and number 
of distinct values which has been asserted into the PROLOG 
internal database is the sum of these statistics for each of 
the component sets in the union. This is a terrible bound 
on these statistics and when this union set iS a component 
set in some higher level union or intersection, the inaccu- 
racy of these bounds is passed on to the calculation of the 
bounds on the size of the higher level set. Because of the 


method for converting input sets specified by a range of 
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values of an attribute into disjoint uniob® of partition 
sets of the attribute, anew tighter bounc on the calcula- 
tion of the lower level statistics would significantly 
improve the accuracy of the bounds calculated for the size 
of the higher level set. 

Two general methods for improving the accuracy of the 
bounds on these level informaticn statistics seem promising. 
The first is to test the lcwer level union to detect 
disjointness and to handle such disjoint sets differently 
than sets which are not disjoint. The problem with this 
approac.i#is that it only shows an improvement for the parti- 
tioning attribute's calculations. For the other attributes, 
deterMining that the sets have no tuples in common does not 
guarantee that they do not have any values of the non- 
partitioning attribute in common. For example, in the 
Sample medical database, if we were dealing with the two 
partition sets patno(1) and patno(2), we could state with 
confidence that the mode frequency patno of their union was 
the maximum of the mode frequencies for patno in the two 
component sets because these sets clearly are disjoint. 
However, consider the cholestercl attribute. The mode value 
for cholesterol in both of these sets is the same even 
though the sets are disjoint. Therefore, if we used the 
same Ccrule for disjoint sets, the calculated upper bound on 
the mode frequency of the union would be much less than the 
actual mode frequency. 

The root of the problem in the above paragraph is the 
fact that for the non-partitioning attribute in the compo- 
nent sets of a union or intersection, we cannot determine 
whether the mode values of the sets are the same when we are 
only provided with the mode freguency. The second approach 
to im oving accuracy of these calculations is to augment 
the datakase axwstract with the actual mode value in addition 


to the other statistics. This approach would increase tne 
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Size of the database abstract ry about one third but pfrob- 
ably would increase the information content of the database 


abstract by much nore. 


C. IMPROVEMENTS IW SYSTEM EFFICIENCY 


As stated in chapter I, speed of execution of the systen 
waS not a primary ccncern during its construction. The 
purpose of a prototype system is not to produce the nost 
efficient system but to prove the feasibility of the 
approach. Conseguently, there is much room for improvement 
in the efficiency of this systen. PupsewOrwalit, the €xper— 
tise of the author in developing tight PROLOG code (if such 
a thing exists), matured a great deal from the early to the 
late stages of the progran. Many rules exist in the 
program, in which a cut predicate might greatly improve the 
speed of execution of the system. These cut predicates have 
been inserted in many rules but no determined effort has 
been made to optimize the code in this way. 

Another area which could increase the efficiency of the 
program is a shift in the overall structure of the highest 
level predicates. As stated many times previously, part of 
the purpose of every ‘dot rule is to produce the side effect 
of converting the input set into a simple set with respect 
to the current level of knowledge in case this set isa 
component set ina higher level union or intersection. This 
system inherently wastes a great deal of execution time 
calculating upper and lower bcunds on levels two through 
five statistics for the highest level sets. 

One solution to this wastefulness would be to provide 
different sets of rules to deal with the highest level sets 
and the lower level sets. For example, the ‘dosup2low' rule 
would calculate the upper bound on the input set as well as 


upper bounds on the mode frequency and number of distinct 
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values of the input set because it is certain that this set 
is part of a higher level union or intersection. However, 
the ‘dosup2hi't rule would only calculate the upper bounds on 
the size of the input set because this rule is only called 
for the highest level sets. These rules could share many of 
the lower level rules which they call in order to minimize 
the increase in the size of the rule base for this revised 


systen. 


D. SUAMARY 


The DAQUES system 1s designed primarily to demonstrate 
how the user of a database can be provided with a hierarchy 
of knowledge to produce increasingly more accurate bounds 
and estimations for his queries. A primary goal of the 
system is to provide the most rapid response to users! 
gueries possible, with the most accurate results possible 
wheh a user may not require exact answers to hiS queries or 
May not have access to the actual database containing that 
information. The improvements to the basic DAQUES systen 
discussed in this chapter will help the system to better 


achieve these response time and accuracy goals. 
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